PREDICCIÓN DE LOS SALARIOS TRIMESTRALES EN EL ESTADO DE NEW YORK¶
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import plotly.express as px
import matplotlib.pyplot as plt
import pmdarima as pm
import missingno as msno
import seaborn as sns
import scipy.stats as stats
from sklearn.model_selection import train_test_split
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import statsmodels.api as sm
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import adfuller
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error
from statsmodels.tsa.seasonal import seasonal_decompose
import xgboost as xgb
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
DATASETS
Enlace para descargar los datasets: https://drive.google.com/drive/folders/1hYewiVm4w3-GMIWixnP6Lc5_5HtLJYIo?usp=sharing
El objetivo de este trabajo es construir un modelo predictivo de los salarios trimestrales en el estado de New York.
Este trabajo consta de un conjunto de datos principal 'census' y cuatro conjuntos de datos abiertos.
Los datos abiertos contienen información macroeconómica y geográfica y han sido agregados al trabajo con el propósito de mejorar el modelo predictivo.
CENSUS
El conjunto de datos principal de este trabajo procede del Quarterly Census of Employment and Wages program (QCEW), la principal fuente de datos del mercado laboral que existe en el estado de New York. Este conjunto de datos recopila las principales informaciones sobre empleo y salario a nivel estatal. Proporciona un censo virtual del 97% de los empleados y empleadores no agrícolas del estado, lo que lo hace una fuente de datos muy amplia y con esto precisa.
Debido a su naturaleza, el conjunto de datos principal se presta a ser cumplimentado con datos abiertos relativos a variables macroeconómicas.
GDP
El dataset 'GDP' contiene los datos trimestrales (en trillones de dólares) relativos al producto interno bruto en el estado de New York de 2000 a 2022. Los valores desde el año 2000 hasta finales de 2004, han sido estimados considerando el GDP anual del estado de New York.
INFLATION
El dataset de 'Inflation' contiene las tasas de inflación trimestral en EE.UU. desde el año 2000 hasta finales de 2019.
UNEMPLOYMENT
El dataset 'Unemployment' contiene los datos de desempleo detallados por mano de obra, empleados, desempleados y tasa de desempleo. Todos estos datos están detallados por año, més y área geográfica.
ÁREA
El dataset 'Area' contiene la información relativa a las subdivisiones administrativas de las áreas geográficas en el estado de New York.
County. Los condados son los pilares utilizados para construir áreas geográficas cada vez más grandes para las cuales se reportan estadísticas del mercado laboral. El estado de New York cuenta con 62 condados.
Metropolitan Statistical Area. Cada Área Estadística Metropolitana debe tener al menos un área urbana de 50.000 o más habitantes y consta de uno o más condados asociados. El dataset cuenta con 12 Áreas Estadísticas Metropolitanas.
Micropolitan Statistical Area. Cada Área Estadística Micropolitana debe tener al menos un área urbana de al menos 10.000 pero menos de 50.000 habitantes y consta de uno o más condados asociados. El estado de New York cuenta con 14 Áreas Estadísticas Micropolitanas.
Metropolitan Division. La División Metropolitana es una agrupación más pequeña de los condados que se cumple en caso un Área Estadística Metropolitana contenga un único núcleo con una población de 2,5 millones de habitantes. El estado de New York cuenta con 2 Divisiones Metropolitanas.
Workforce Investment Region. Es el organismo para el desarrollo de la fuerza laboral. El estado de New York con 33 juntas de desarrollo de la fuerza laboral local. Labor Market Area. Las Regiones del Mercado Laboral (LMR) corresponden a diez áreas geográficas del estado, que están definidas por el Departamento de Trabajo del Estado de Nueva York.
census = pd.read_csv("C:\\Users\\giuli\\UCM\\TFM\\census_2000_2019.csv")
gdp = pd.read_csv("C:\\Users\\giuli\\UCM\\TFM\\GDP.csv")
inflation = pd.read_csv("C:\\Users\\giuli\\UCM\\TFM\\inflation_USA_2000_2019.csv")
unemployment = pd.read_csv("C:\\Users\\giuli\\UCM\\TFM\\local-area-unemployment-statistics-beginning-1976-1.csv")
Area = pd.read_csv("C:\\Users\\giuli\\UCM\\TFM\\Area.csv")
PREPROCESADO
PREPROCESADO DE CENSUS
El dataset ‘Census’ cuenta con 12 variables y 982173 observaciones.
De una primera observación se nota que la variable ‘Total Wage’ tiene valores nulos.
En primer lugar, para facilitar la gestión de los datos se divide por 1.000.000 el valor de ‘Total Wage’ y se almacena en una nueva variable ‘Wage_million$’.
Se crea una nueva variable ‘Employment’ donde se almacena la suma de la población empleada en cada trimestre.
Se necesita crear una referencia temporal trimestral de tipo datetime. Para lograrlo, se remplazan los valores 1, 2, 3, 4 de la variable ‘Quarter’ por 31-03, 30-06, 30-09, 31-12 y al final de cada fecha se agrega la información del año correspondiente, procedente de la variable ‘Year’. Al final se transforma el objeto string en objeto datetime, con su formato correspondiente.
En cuanto a la variable ‘Área’ se decide quedar con la unidad administrativa más pequeña, o sea el condado. Se eliminan todas las observaciones referentes a unidades administrativas diferentes. Por dicha razón, se excluyen de la columna ‘Área’ todas las observaciones que no contienen la palabra ‘county’.
La variable ‘NAICS Title’ deja ver que hay una agrupación de datos ‘Total, All Industries’. Esta agrupación podría distorsionar el análisis, así que se eliminan todas las observaciones a las cuales les corresponde este valor.
En fin, se eliminan las columnas superfluas ‘Total Wage’, ‘index’, ‘Month 1 Employment’, ‘Month 2 Employment’, ‘Month 3 Employment’, ‘Area Type’.
census.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 982173 entries, 0 to 982172 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 982173 non-null int64 1 Area Type 982173 non-null object 2 Area 982173 non-null object 3 Year 982173 non-null int64 4 Quarter 982173 non-null int64 5 NAICS 982173 non-null int64 6 NAICS Title 982173 non-null object 7 Establishments 982173 non-null int64 8 Month 1 Employment 982173 non-null int64 9 Month 2 Employment 982173 non-null int64 10 Month 3 Employment 982173 non-null int64 11 Total Wage 982145 non-null float64 dtypes: float64(1), int64(8), object(3) memory usage: 89.9+ MB
census.head()
| index | Area Type | Area | Year | Quarter | NAICS | NAICS Title | Establishments | Month 1 Employment | Month 2 Employment | Month 3 Employment | Total Wage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | State | New York State | 2019 | 1 | 0 | Total, All Industries | 640172 | 9341688 | 9392837 | 9446466 | 2.001011e+11 |
| 1 | 1 | County | Albany County | 2019 | 1 | 0 | Total, All Industries | 10177 | 231193 | 231993 | 232076 | 3.439674e+09 |
| 2 | 2 | County | Allegany County | 2019 | 1 | 0 | Total, All Industries | 912 | 12748 | 13230 | 13254 | 1.263470e+08 |
| 3 | 3 | County | Bronx County | 2019 | 1 | 0 | Total, All Industries | 18622 | 317542 | 319956 | 322237 | 4.528452e+09 |
| 4 | 4 | County | Broome County | 2019 | 1 | 0 | Total, All Industries | 4381 | 83922 | 85202 | 85620 | 9.918101e+08 |
census.columns = census.columns.str.strip()
census["Wage_million$"] = census["Total Wage"]/1000000
census['Employment'] = census['Month 1 Employment']+census['Month 2 Employment']+census['Month 3 Employment']
census["Quarter"] = census["Quarter"].astype('string')
census["Year"] = census["Year"].astype('string')
census['Quarter'] = census['Quarter'].replace(['1'], ['31-03-'])
census['Quarter'] = census['Quarter'].replace(['2'], ['30-06-'])
census['Quarter'] = census['Quarter'].replace(['3'], ['30-09-'])
census['Quarter'] = census['Quarter'].replace(['4'], ['31-12-'])
census['Quarter'] = census['Quarter']+(census['Year'])
census['Quarter'] = pd.to_datetime(census['Quarter'], format='%d-%m-%Y',errors='raise')
census= census.drop(census[census['Area Type'] != 'County'].index)
census['County']=census['Area']
census = census.loc[~census['NAICS Title'].str.contains('Total', case=False)]
census = census.drop(['Total Wage'], axis=1)
census = census.drop(['index'], axis=1)
census = census.drop(["Month 1 Employment"], axis=1)
census = census.drop(["Month 2 Employment"], axis=1)
census = census.drop(["Month 3 Employment"], axis=1)
census=census.drop(['Area Type'], axis=1)
census=census.drop(['NAICS'], axis=1)
census.head()
| Area | Year | Quarter | NAICS Title | Establishments | Wage_million$ | Employment | County | |
|---|---|---|---|---|---|---|---|---|
| 142 | Broome County | 2018 | 2018-03-31 | Florists | 6 | 0.184196 | 121 | Broome County |
| 240 | Essex County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 30 | 0.667383 | 291 | Essex County |
| 269 | Albany County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 18 | 1.971956 | 591 | Albany County |
| 270 | Allegany County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 18 | 0.930796 | 358 | Allegany County |
| 271 | Bronx County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 4 | 0.213331 | 65 | Bronx County |
PREPROCESADO DE GDP
El dataset ‘gdp’ contiene 2 variables y 92 observaciones.
Para obtener la referencia temporal trimestral que sucesivamente se necesitará para incorporar este dataset al conjunto de datos principal, se transforma la columna ‘Year’ dejando solo los 11 últimos dígitos y se pasa a formato datetime.
gdp.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 92 entries, 0 to 91 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Years 92 non-null object 1 GDP_trillion$ 92 non-null int64 dtypes: int64(1), object(1) memory usage: 1.6+ KB
gdp.head()
| Years | GDP_trillion$ | |
|---|---|---|
| 0 | 2022-10-01/2022-12-31 | 20984356 |
| 1 | 2022-07-01/2022-09-30 | 20705577 |
| 2 | 2022-04-01/2022-06-30 | 20349123 |
| 3 | 2022-01-01/2022-03-31 | 20088134 |
| 4 | 2021-10-01/2021-12-31 | 19866796 |
gdp['Quarter'] = gdp['Years'].str.slice(11)
gdp = gdp.drop(['Years'], axis=1)
gdp['Quarter'] = pd.to_datetime(gdp['Quarter'], format='%Y-%m-%d')
gdp.head()
| GDP_trillion$ | Quarter | |
|---|---|---|
| 0 | 20984356 | 2022-12-31 |
| 1 | 20705577 | 2022-09-30 |
| 2 | 20349123 | 2022-06-30 |
| 3 | 20088134 | 2022-03-31 |
| 4 | 19866796 | 2021-12-31 |
PREPROCESADO DE INFLATION
El dataset ‘Inflation’ está compuesto por 3 variables y 80 observaciones.
Se necesita crear una columna con la referencia temporal trimestral para luego poder agregar este dataset al conjunto de datos principal. Por dicha razón, se reemplazan los valores 1, 2, 3, 4 correspondientes a los trimestres por 31-03, 30-06, 30-09, 31-12 y se le añade el año correspondiente. Todo se convierte a datetime especificando el formato correspondiente.
inflation.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 80 entries, 0 to 79 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 80 non-null int64 1 Quarter 80 non-null int64 2 Inflation % 80 non-null float64 dtypes: float64(1), int64(2) memory usage: 2.0 KB
inflation.head()
| Year | Quarter | Inflation % | |
|---|---|---|---|
| 0 | 2000 | 1 | 3.4 |
| 1 | 2000 | 2 | 1.8 |
| 2 | 2000 | 3 | 2.5 |
| 3 | 2000 | 4 | 2.2 |
| 4 | 2001 | 1 | 2.7 |
inflation["Quarter"] = inflation["Quarter"].astype('string')
inflation["Year"] = inflation["Year"].astype('string')
inflation['Quarter'] = inflation['Quarter'].replace({'1': '31-03-', '2': '30-06-', '3': '30-09-', '4': '31-12-'})
inflation['Quarter']=inflation['Quarter']+inflation['Year']
inflation = inflation.drop(['Year'], axis=1)
inflation['Quarter'] = pd.to_datetime(inflation['Quarter'], format='%d-%m-%Y')
inflation.head()
| Quarter | Inflation % | |
|---|---|---|
| 0 | 2000-03-31 | 3.4 |
| 1 | 2000-06-30 | 1.8 |
| 2 | 2000-09-30 | 2.5 |
| 3 | 2000-12-31 | 2.2 |
| 4 | 2001-03-31 | 2.7 |
PREPROCESADO DE UNEPLOYEMENT
El dataset ‘Unemployment’ contiene 7 variables con 75712 observaciones cada una.
Para crear la referencia temporal trimestral, que luego permitirá incorporar este dataset al conjunto de datos principal, se aportan los mismos cambios ya ejecutados en los otros datasets. La variable ‘Unemployment Rate’ es mensual y debido a la necesidad de agrupar los datos trimestralmente para que el dataset sea compatible con el resto de datos, es necesario agruparla por trimestre y área, considerando los valores medios.
La variable ‘Área’ contiene algunas observaciones que se refieren a unidades administrativas diferentes de los condados, así que hay que filtrar estos valores para excluirlos del análisis y preparar el dataset para que sea compatible con el conjunto principal.
Se eliminan las variables sobrantes 'Month','Year','Labor Force','Employed','Unemployed'
unemployment.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 75712 entries, 0 to 75711 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Area 75712 non-null object 1 Year 75712 non-null int64 2 Month 75712 non-null int64 3 Labor Force 75712 non-null int64 4 Employed 75712 non-null int64 5 Unemployed 75712 non-null int64 6 Unemployment Rate 75712 non-null float64 dtypes: float64(1), int64(5), object(1) memory usage: 4.0+ MB
unemployment.head()
| Area | Year | Month | Labor Force | Employed | Unemployed | Unemployment Rate | |
|---|---|---|---|---|---|---|---|
| 0 | Albany City | 2018 | 10 | 48200 | 46400 | 1800 | 3.8 |
| 1 | Albany City | 2018 | 9 | 47600 | 45600 | 2000 | 4.1 |
| 2 | Albany City | 2018 | 8 | 47300 | 45100 | 2300 | 4.8 |
| 3 | Albany City | 2018 | 7 | 47800 | 45400 | 2400 | 4.9 |
| 4 | Albany City | 2018 | 6 | 48200 | 45800 | 2400 | 5.0 |
unemployment["Year"] = unemployment["Year"].astype('string')
unemployment["Month"] = unemployment["Month"].astype('string')
unemployment["Area"] = unemployment["Area"].astype('category')
unemployment['Quarter'] = unemployment['Month'].replace({'1': '31-03', '2': '31-03', '3': '31-03','4': '30-06', '5': '30-06', '6': '30-06','7': '30-09', '8': '30-09', '9': '30-09','10': '31-12', '11': '31-12', '12': '31-12'})
unemployment['Quarter'] = (unemployment["Quarter"] + ("-") + unemployment["Year"])
unemployment['Quarter']
0 31-12-2018
1 30-09-2018
2 30-09-2018
3 30-09-2018
4 30-06-2018
...
75707 30-06-1990
75708 30-06-1990
75709 31-03-1990
75710 31-03-1990
75711 31-03-1990
Name: Quarter, Length: 75712, dtype: string
unemployment['Quarter'] = pd.to_datetime(unemployment['Quarter'], format='%d-%m-%Y')
unemployment.columns = unemployment.columns.str.strip()
unemployment['Unemployment Rate'] = unemployment.groupby(['Quarter', 'Area'])['Unemployment Rate'].transform('mean')
unemployment = unemployment.loc[unemployment['Area'].str.contains('County', case=False)]
unemployment = unemployment.drop(['Month','Year','Labor Force','Employed','Unemployed'], axis=1)
unemployment= unemployment.drop_duplicates()
unemployment.head()
| Area | Unemployment Rate | Quarter | |
|---|---|---|---|
| 346 | Albany County | 3.100000 | 2018-12-31 |
| 347 | Albany County | 3.700000 | 2018-09-30 |
| 350 | Albany County | 3.833333 | 2018-06-30 |
| 353 | Albany County | 4.633333 | 2018-03-31 |
| 356 | Albany County | 4.066667 | 2017-12-31 |
unemployment.info()
<class 'pandas.core.frame.DataFrame'> Index: 7192 entries, 346 to 74671 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Area 7192 non-null category 1 Unemployment Rate 7192 non-null float64 2 Quarter 7192 non-null datetime64[ns] dtypes: category(1), datetime64[ns](1), float64(1) memory usage: 192.4 KB
PREPROCESADO DE AREA
El dataset ‘Área’ cuenta con 6 variables y 62 observaciones. Se observan valores nulos debidos a que algunas de las categorías de subdivisión administrativa se cumplimentan entre sí.
Area.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 62 entries, 0 to 61 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 County 62 non-null object 1 Metropolitan Statistical Area 29 non-null object 2 Micropolitan Statistical Area 14 non-null object 3 Metropolitan Division 10 non-null object 4 Workforce Investment Region 62 non-null object 5 Labor Market Area 62 non-null object dtypes: object(6) memory usage: 3.0+ KB
Area.head()
| County | Metropolitan Statistical Area | Micropolitan Statistical Area | Metropolitan Division | Workforce Investment Region | Labor Market Area | |
|---|---|---|---|---|---|---|
| 0 | Albany County | Albany-Schenectady-Troy Metro Area | NaN | NaN | Albany-Rensselaer-Schenectady Counties | Capital Region |
| 1 | Allegany County | NaN | NaN | NaN | Allegany-Cattaraugus | Western New York Region |
| 2 | Bronx County | NaN | NaN | New York City Region | New York City | New York City Region |
| 3 | Broome County | Binghamton Metro Area | NaN | NaN | Broome-Tioga | Southern Tier Region |
| 4 | Cattaraugus County | NaN | Olean, NY Micropolitan Statistical Area | NaN | Allegany-Cattaraugus | Western New York Region |
Area['LMA'] = Area['Labor Market Area'].astype('category')
Area['WIR'] = Area['Workforce Investment Region'].astype('category')
Area['County']=Area['County'].astype('category')
Area = Area.drop(['Labor Market Area','Workforce Investment Region','Metropolitan Statistical Area', 'Micropolitan Statistical Area','Metropolitan Division'], axis=1)
Area.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 62 entries, 0 to 61 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 County 62 non-null category 1 LMA 62 non-null category 2 WIR 62 non-null category dtypes: category(3) memory usage: 4.5 KB
UNIÓN DE DATASETS
Tras preprocesar los datos de todos los datasets, se va a preparar un único conjunto de datos.
‘census_1’ es el dataset generado por la unión de ‘census’ y ‘gdp’ tomando como referencia la variable ‘Quarter’. En el paso sucesivo, se le añade a este dataset la información contenida en ‘inflation’, tomando como referencia ‘Quarter’. El resultado de esta unión será el dataset ‘census_2’, al cual se le agrega el dataset ‘unemployment’, tomando como referencia ‘Quarter’ y ‘Area’. Este nuevo dataset se denomina ‘census_3’. En fin se crea el dataset ‘census_4’ que incluye el dataset anterior más la información del dataset ‘Area’, tomando como referencia la columna ‘County’.
El nuevo conjunto de datos ‘census_4’ tiene 13 variables y 472.807 observaciones.
Al parecer, las variables ‘Wage_million$’ y ‘Unemployment Rate’ tienen valores faltantes por gestionar.
Además, con el método describe aplicado al conjunto de variables numéricas, se observa que el valor mínimo de ‘Wage_million’ es 0. Esto podría ser un missing no declarado. Tras comprobar que son solo 9 observaciones en todo el dataset a llevar valor 0 en esta variable se decide eliminarlas, ya que tienen muy baja incidencia considerando la totalidad del conjunto de datos.
Antes de gestionar los datos faltantes, se eliminan del dataset las variables geográficas que no se van a incluir en el análisis (‘County’,’WIR’), ya que se escoge la variable 'LMA'.
census.head()
| Area | Year | Quarter | NAICS Title | Establishments | Wage_million$ | Employment | County | |
|---|---|---|---|---|---|---|---|---|
| 142 | Broome County | 2018 | 2018-03-31 | Florists | 6 | 0.184196 | 121 | Broome County |
| 240 | Essex County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 30 | 0.667383 | 291 | Essex County |
| 269 | Albany County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 18 | 1.971956 | 591 | Albany County |
| 270 | Allegany County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 18 | 0.930796 | 358 | Allegany County |
| 271 | Bronx County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 4 | 0.213331 | 65 | Bronx County |
census_1= pd.merge(census, gdp, on=['Quarter'], how='left')
census_1
| Area | Year | Quarter | NAICS Title | Establishments | Wage_million$ | Employment | County | GDP_trillion$ | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Broome County | 2018 | 2018-03-31 | Florists | 6 | 0.184196 | 121 | Broome County | 16627803 |
| 1 | Essex County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 30 | 0.667383 | 291 | Essex County | 17529646 |
| 2 | Albany County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 18 | 1.971956 | 591 | Albany County | 17529646 |
| 3 | Allegany County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 18 | 0.930796 | 358 | Allegany County | 17529646 |
| 4 | Bronx County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 4 | 0.213331 | 65 | Bronx County | 17529646 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 472802 | Sullivan County | 2000 | 2000-12-31 | Unclassified | 21 | 0.191175 | 155 | Sullivan County | 8412000 |
| 472803 | Tompkins County | 2000 | 2000-12-31 | Unclassified | 10 | 0.104896 | 50 | Tompkins County | 8412000 |
| 472804 | Ulster County | 2000 | 2000-12-31 | Unclassified | 49 | 0.360702 | 298 | Ulster County | 8412000 |
| 472805 | Wayne County | 2000 | 2000-12-31 | Unclassified | 15 | 0.179872 | 52 | Wayne County | 8412000 |
| 472806 | Westchester County | 2000 | 2000-12-31 | Unclassified | 535 | 9.324520 | 3120 | Westchester County | 8412000 |
472807 rows × 9 columns
census_2= pd.merge(census_1, inflation, on=['Quarter'], how='left')
census_2
| Area | Year | Quarter | NAICS Title | Establishments | Wage_million$ | Employment | County | GDP_trillion$ | Inflation % | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Broome County | 2018 | 2018-03-31 | Florists | 6 | 0.184196 | 121 | Broome County | 16627803 | 2.8 |
| 1 | Essex County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 30 | 0.667383 | 291 | Essex County | 17529646 | 0.9 |
| 2 | Albany County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 18 | 1.971956 | 591 | Albany County | 17529646 | 0.9 |
| 3 | Allegany County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 18 | 0.930796 | 358 | Allegany County | 17529646 | 0.9 |
| 4 | Bronx County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 4 | 0.213331 | 65 | Bronx County | 17529646 | 0.9 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 472802 | Sullivan County | 2000 | 2000-12-31 | Unclassified | 21 | 0.191175 | 155 | Sullivan County | 8412000 | 2.2 |
| 472803 | Tompkins County | 2000 | 2000-12-31 | Unclassified | 10 | 0.104896 | 50 | Tompkins County | 8412000 | 2.2 |
| 472804 | Ulster County | 2000 | 2000-12-31 | Unclassified | 49 | 0.360702 | 298 | Ulster County | 8412000 | 2.2 |
| 472805 | Wayne County | 2000 | 2000-12-31 | Unclassified | 15 | 0.179872 | 52 | Wayne County | 8412000 | 2.2 |
| 472806 | Westchester County | 2000 | 2000-12-31 | Unclassified | 535 | 9.324520 | 3120 | Westchester County | 8412000 | 2.2 |
472807 rows × 10 columns
census_3 = pd.merge(census_2, unemployment, on=['Quarter','Area'], how='left')
census_3
| Area | Year | Quarter | NAICS Title | Establishments | Wage_million$ | Employment | County | GDP_trillion$ | Inflation % | Unemployment Rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Broome County | 2018 | 2018-03-31 | Florists | 6 | 0.184196 | 121 | Broome County | 16627803 | 2.8 | 6.700000 |
| 1 | Essex County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 30 | 0.667383 | 291 | Essex County | 17529646 | 0.9 | NaN |
| 2 | Albany County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 18 | 1.971956 | 591 | Albany County | 17529646 | 0.9 | NaN |
| 3 | Allegany County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 18 | 0.930796 | 358 | Allegany County | 17529646 | 0.9 | NaN |
| 4 | Bronx County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 4 | 0.213331 | 65 | Bronx County | 17529646 | 0.9 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 472802 | Sullivan County | 2000 | 2000-12-31 | Unclassified | 21 | 0.191175 | 155 | Sullivan County | 8412000 | 2.2 | 4.233333 |
| 472803 | Tompkins County | 2000 | 2000-12-31 | Unclassified | 10 | 0.104896 | 50 | Tompkins County | 8412000 | 2.2 | 2.900000 |
| 472804 | Ulster County | 2000 | 2000-12-31 | Unclassified | 49 | 0.360702 | 298 | Ulster County | 8412000 | 2.2 | 3.266667 |
| 472805 | Wayne County | 2000 | 2000-12-31 | Unclassified | 15 | 0.179872 | 52 | Wayne County | 8412000 | 2.2 | 3.466667 |
| 472806 | Westchester County | 2000 | 2000-12-31 | Unclassified | 535 | 9.324520 | 3120 | Westchester County | 8412000 | 2.2 | 3.100000 |
472807 rows × 11 columns
Area.head()
| County | LMA | WIR | |
|---|---|---|---|
| 0 | Albany County | Capital Region | Albany-Rensselaer-Schenectady Counties |
| 1 | Allegany County | Western New York Region | Allegany-Cattaraugus |
| 2 | Bronx County | New York City Region | New York City |
| 3 | Broome County | Southern Tier Region | Broome-Tioga |
| 4 | Cattaraugus County | Western New York Region | Allegany-Cattaraugus |
census_4 = pd.merge(census_3, Area, on=['County'], how='left')
census_4
| Area | Year | Quarter | NAICS Title | Establishments | Wage_million$ | Employment | County | GDP_trillion$ | Inflation % | Unemployment Rate | LMA | WIR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Broome County | 2018 | 2018-03-31 | Florists | 6 | 0.184196 | 121 | Broome County | 16627803 | 2.8 | 6.700000 | Southern Tier Region | Broome-Tioga |
| 1 | Essex County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 30 | 0.667383 | 291 | Essex County | 17529646 | 0.9 | NaN | North Country Region | Clinton-Essex-Franklin-Hamilton |
| 2 | Albany County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 18 | 1.971956 | 591 | Albany County | 17529646 | 0.9 | NaN | Capital Region | Albany-Rensselaer-Schenectady Counties |
| 3 | Allegany County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 18 | 0.930796 | 358 | Allegany County | 17529646 | 0.9 | NaN | Western New York Region | Allegany-Cattaraugus |
| 4 | Bronx County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 4 | 0.213331 | 65 | Bronx County | 17529646 | 0.9 | NaN | New York City Region | New York City |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 472802 | Sullivan County | 2000 | 2000-12-31 | Unclassified | 21 | 0.191175 | 155 | Sullivan County | 8412000 | 2.2 | 4.233333 | Hudson Valley Region | Sullivan |
| 472803 | Tompkins County | 2000 | 2000-12-31 | Unclassified | 10 | 0.104896 | 50 | Tompkins County | 8412000 | 2.2 | 2.900000 | Southern Tier Region | Tompkins |
| 472804 | Ulster County | 2000 | 2000-12-31 | Unclassified | 49 | 0.360702 | 298 | Ulster County | 8412000 | 2.2 | 3.266667 | Hudson Valley Region | Ulster |
| 472805 | Wayne County | 2000 | 2000-12-31 | Unclassified | 15 | 0.179872 | 52 | Wayne County | 8412000 | 2.2 | 3.466667 | Finger Lakes Region | Ontario-Seneca-Wayne-Yates |
| 472806 | Westchester County | 2000 | 2000-12-31 | Unclassified | 535 | 9.324520 | 3120 | Westchester County | 8412000 | 2.2 | 3.100000 | Hudson Valley Region | Putnam-Balance Of Westchester |
472807 rows × 13 columns
census_4.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 472807 entries, 0 to 472806 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Area 472807 non-null object 1 Year 472807 non-null string 2 Quarter 472807 non-null datetime64[ns] 3 NAICS Title 472807 non-null object 4 Establishments 472807 non-null int64 5 Wage_million$ 472779 non-null float64 6 Employment 472807 non-null int64 7 County 472807 non-null object 8 GDP_trillion$ 472807 non-null int64 9 Inflation % 472807 non-null float64 10 Unemployment Rate 447126 non-null float64 11 LMA 472807 non-null category 12 WIR 472807 non-null category dtypes: category(2), datetime64[ns](1), float64(3), int64(3), object(3), string(1) memory usage: 40.6+ MB
census_4.head()
| Area | Year | Quarter | NAICS Title | Establishments | Wage_million$ | Employment | County | GDP_trillion$ | Inflation % | Unemployment Rate | LMA | WIR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Broome County | 2018 | 2018-03-31 | Florists | 6 | 0.184196 | 121 | Broome County | 16627803 | 2.8 | 6.7 | Southern Tier Region | Broome-Tioga |
| 1 | Essex County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 30 | 0.667383 | 291 | Essex County | 17529646 | 0.9 | NaN | North Country Region | Clinton-Essex-Franklin-Hamilton |
| 2 | Albany County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 18 | 1.971956 | 591 | Albany County | 17529646 | 0.9 | NaN | Capital Region | Albany-Rensselaer-Schenectady Counties |
| 3 | Allegany County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 18 | 0.930796 | 358 | Allegany County | 17529646 | 0.9 | NaN | Western New York Region | Allegany-Cattaraugus |
| 4 | Bronx County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 4 | 0.213331 | 65 | Bronx County | 17529646 | 0.9 | NaN | New York City Region | New York City |
census_4.describe(include=np.number)
| Establishments | Wage_million$ | Employment | GDP_trillion$ | Inflation % | Unemployment Rate | |
|---|---|---|---|---|---|---|
| count | 472807.000000 | 472779.000000 | 4.728070e+05 | 4.728070e+05 | 472807.000000 | 447126.000000 |
| mean | 189.544520 | 40.733262 | 8.077940e+03 | 1.296661e+07 | 1.799564 | 5.787943 |
| std | 772.741135 | 430.958390 | 3.727073e+04 | 2.981696e+06 | 1.426124 | 1.843885 |
| min | 1.000000 | 0.000000 | 2.000000e+00 | 8.412000e+06 | -5.600000 | 2.433333 |
| 25% | 9.000000 | 0.678741 | 2.600000e+02 | 1.040547e+07 | 1.200000 | 4.366667 |
| 50% | 26.000000 | 2.573643 | 9.160000e+02 | 1.260281e+07 | 2.000000 | 5.300000 |
| 75% | 87.000000 | 11.443832 | 3.691000e+03 | 1.584316e+07 | 2.700000 | 7.033333 |
| max | 20551.000000 | 48118.821276 | 1.227740e+06 | 1.789074e+07 | 4.500000 | 14.133333 |
(census_4['Wage_million$'] == 0).sum()
9
census_4 = census_4[census_4['Wage_million$'] != 0]
census_4.head()
| Area | Year | Quarter | NAICS Title | Establishments | Wage_million$ | Employment | County | GDP_trillion$ | Inflation % | Unemployment Rate | LMA | WIR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Broome County | 2018 | 2018-03-31 | Florists | 6 | 0.184196 | 121 | Broome County | 16627803 | 2.8 | 6.7 | Southern Tier Region | Broome-Tioga |
| 1 | Essex County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 30 | 0.667383 | 291 | Essex County | 17529646 | 0.9 | NaN | North Country Region | Clinton-Essex-Franklin-Hamilton |
| 2 | Albany County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 18 | 1.971956 | 591 | Albany County | 17529646 | 0.9 | NaN | Capital Region | Albany-Rensselaer-Schenectady Counties |
| 3 | Allegany County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 18 | 0.930796 | 358 | Allegany County | 17529646 | 0.9 | NaN | Western New York Region | Allegany-Cattaraugus |
| 4 | Bronx County | 2019 | 2019-03-31 | Agriculture, Forestry, Fishing and Hunting | 4 | 0.213331 | 65 | Bronx County | 17529646 | 0.9 | NaN | New York City Region | New York City |
census_4.columns = census_4.columns.str.strip()
census_4 = census_4.drop('County', axis=1)
census_4 = census_4.drop('WIR', axis=1)
census_4 = census_4.drop(['Area'], axis=1)
census_4=census_4.reset_index()
census_4["NAICS Title"] = census_4["NAICS Title"].astype('category')
census_4.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 472798 entries, 0 to 472797 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 472798 non-null int64 1 Year 472798 non-null string 2 Quarter 472798 non-null datetime64[ns] 3 NAICS Title 472798 non-null category 4 Establishments 472798 non-null int64 5 Wage_million$ 472770 non-null float64 6 Employment 472798 non-null int64 7 GDP_trillion$ 472798 non-null int64 8 Inflation % 472798 non-null float64 9 Unemployment Rate 447117 non-null float64 10 LMA 472798 non-null category dtypes: category(2), datetime64[ns](1), float64(3), int64(4), string(1) memory usage: 33.8 MB
ANÁLISIS Y GESTIÓN DE LOS NULOS
La variable ‘NAICS Title’ presenta una categoria de ‘Unclassified’, este valor se pasa a nulo.
Se observan cuántos valores nulos hay en cada variable y su proporción con respecto al total de los datos.
En general hay pocos valores nulos. Los nulos de la variable ‘Wage_million’ son menos del 0,001%, los de ‘NAICS Title’ son aproximadamente un 1% y los de la variable ‘Unemployment Rate’ son casi un 6% del total de los datos.
Los valores nulos no tienen correlación entre sí.
Se eliminan todas las observaciones que contienen valores nulos en las variables ‘Wage_million$’ y ‘NAICS Title’ debido a su baja incidencia con respecto al total de los datos. Mientras que para la variable ‘Unemployment Rate’, los valores nulos se estiman con la media de los otros valores.
census_4['NAICS Title'] = census_4['NAICS Title'].replace('Unclassified', pd.NA)
census_4.isnull().sum()
index 0 Year 0 Quarter 0 NAICS Title 5303 Establishments 0 Wage_million$ 28 Employment 0 GDP_trillion$ 0 Inflation % 0 Unemployment Rate 25681 LMA 0 dtype: int64
census_4.apply(lambda x: x.isna().sum()/x.count()*100)
index 0.000000 Year 0.000000 Quarter 0.000000 NAICS Title 1.134344 Establishments 0.000000 Wage_million$ 0.005923 Employment 0.000000 GDP_trillion$ 0.000000 Inflation % 0.000000 Unemployment Rate 5.743687 LMA 0.000000 dtype: float64
msno.matrix(census_4.sort_values(by='Unemployment Rate'))
<Axes: >
census_4 = census_4.dropna(subset=['NAICS Title'])
census_4 = census_4.dropna(subset=['Wage_million$'])
media = census_4['Unemployment Rate'].mean()
census_4['Unemployment Rate'].fillna(media, inplace=True)
census_4.info()
<class 'pandas.core.frame.DataFrame'> Index: 467467 entries, 0 to 472753 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 467467 non-null int64 1 Year 467467 non-null string 2 Quarter 467467 non-null datetime64[ns] 3 NAICS Title 467467 non-null category 4 Establishments 467467 non-null int64 5 Wage_million$ 467467 non-null float64 6 Employment 467467 non-null int64 7 GDP_trillion$ 467467 non-null int64 8 Inflation % 467467 non-null float64 9 Unemployment Rate 467467 non-null float64 10 LMA 467467 non-null category dtypes: category(2), datetime64[ns](1), float64(3), int64(4), string(1) memory usage: 37.0 MB
ANÁLISIS Y GESTIÓN DE OUTLIERS
Para determinar la presencia de outliers en las variables numéricas, en primer lugar, se estudia el coeficiente de asimetría.
Las variables ‘Establishments’, ‘Wage_million’ y ‘Employment’ tienen un coeficiente de asimetría positivo bastante alto, que indica una cola a la derecha de la distribución. También la variable ‘Unemployment’ tiene coeficiente de asimetría positivo pero de forma más contenida. En cambio, la variable ‘GDP_trillion$’ tiene un coeficiente de asimetría cercano a 0 que indica una distribución más o menos simétrica y la variable ‘Inflation %’ tiene un coeficiente de asimetría negativo que indica una cola izquierda en la distribución.
Para gestionar los outliers es necesario separar la variable objetivo del dataset.
Se define la función winzorize with pandas, que tiene como parametro la serie a winsorizar y los percentiles a los cuales ajustar los outliers.
Esta función se incorpora en la función gestiona outliers que incluye las opciones de controlar la incidencia de outliers en el total de datos, winsorizar o convertir los outliers a nulos. En este caso se opta por la opción winsor. Una vez gestionados los outliers, se vuelve a unir el resultado con las variables categóricas y la variable objetivo.
y = census_4['Wage_million$']
census_5 = census_4.drop(['Wage_million$'], axis=1)
census_5.select_dtypes(include=np.number).apply(lambda x: x.skew())
index 0.000977 Establishments 12.015200 Employment 15.560795 GDP_trillion$ 0.047392 Inflation % -1.710759 Unemployment Rate 0.918534 dtype: float64
def winsorize_with_pandas(s, limits):
return s.clip(lower=s.quantile(limits[0], interpolation='lower'),
upper=s.quantile(1-limits[1], interpolation='higher'))
def gestiona_outliers(col,clas='winsor'):
print(col.name)
if abs(col.skew()) < 1:
criterio1 = abs((col-col.mean())/col.std())>3
else:
criterio1 = abs((col-col.median()))>8
q1 = col.quantile(0.05)
q3 = col.quantile(0.95)
IQR=q3-q1
criterio2 = (col<(q1 - 3*IQR))|(col>(q3 + 3*IQR))
lower = col[criterio1&criterio2&(col<q1)].count()/col.dropna().count()
upper = col[criterio1&criterio2&(col>q3)].count()/col.dropna().count()
if clas == 'check':
return(lower*100,upper*100,(lower+upper)*100)
elif clas == 'winsor':
return(winsorize_with_pandas(col,(lower,upper)))
elif clas == 'miss':
print('\n MissingAntes: ' + str(col.isna().sum()))
col.loc[criterio1&criterio2] = np.nan
print('MissingDespues: ' + str(col.isna().sum()) +'\n')
return(col)
census_5_winsor = census_5.select_dtypes(include=np.number).copy().apply(lambda x: gestiona_outliers(x))
index Establishments Employment GDP_trillion$ Inflation % Unemployment Rate
census_5_winsor.describe()
| index | Establishments | Employment | GDP_trillion$ | Inflation % | Unemployment Rate | |
|---|---|---|---|---|---|---|
| count | 467467.000000 | 467467.000000 | 467467.000000 | 4.674670e+05 | 467467.000000 | 467467.000000 |
| mean | 236360.995936 | 153.861612 | 6628.912811 | 1.296687e+07 | 1.799484 | 5.787414 |
| std | 136477.703566 | 428.528981 | 18385.722831 | 2.981232e+06 | 1.426027 | 1.792872 |
| min | 0.000000 | 1.000000 | 2.000000 | 8.412000e+06 | -5.600000 | 2.433333 |
| 25% | 118150.500000 | 9.000000 | 267.000000 | 1.040547e+07 | 1.200000 | 4.433333 |
| 50% | 236254.000000 | 26.000000 | 931.000000 | 1.260281e+07 | 2.000000 | 5.433333 |
| 75% | 354559.500000 | 87.000000 | 3736.000000 | 1.584316e+07 | 2.700000 | 6.900000 |
| max | 472762.000000 | 2892.000000 | 132014.000000 | 1.789074e+07 | 4.500000 | 14.133333 |
census_6 = census_5_winsor.join(census_5.select_dtypes(exclude=np.number))
census_6 = census_6.join(y)
var_cont= census_6.select_dtypes(include=np.number)
var_cat= census_6.select_dtypes(exclude=np.number)
INSPECCIÓN GRÁFICA DE LAS VARIABLES CATEGÓRICAS
Observando el histograma de las variables categóricas se nota que en los años 2017 y 2018 hay más observaciones que en los otros años de la serie temporal. Se buscan eventuales valores duplicados y se eliminan.
En cuanto a las variables ‘LMA’ y ‘NAICS Title’ no se observan anomalías en los datos.
for col in var_cat.columns:
var_cat[col] = var_cat[col].astype(str)
fig = px.histogram(var_cat, x=col, title=f'Distribución de {col}')
fig.show()